{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "import sys\n",
    "import os\n",
    "if not any(path.endswith('textbook') for path in sys.path):\n",
    "    sys.path.append(os.path.abspath('../../..'))\n",
    "from textbook_utils import *"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(ch:files_datasets)=\n",
    "# Data Source Examples"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have selected two examples to demonstrate file wrangling concepts: a government survey about drug abuse; and administrative data from the San Francisco Department of Public Health about restaurant inspections. Before we start wrangling, we give an overview of the data scope for these examples (see {numref}`Chapter %s <ch:data_scope>`)."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Drug Abuse Warning Network (DAWN) Survey\n",
    "\n",
    "DAWN is a national health-care survey that monitors trends in drug abuse.\n",
    "The survey aims to estimate the\n",
    "impact of drug abuse on the country's health-care system and improve how\n",
    "emergency departments monitor substance abuse crises. DAWN was administered \n",
    "annually from 1998 through 2011 by\n",
    "the [Substance Abuse and Mental Health Services Administration (SAMHSA)](https://www.samhsa.gov/).\n",
    "In 2018, due in part to the opioid epidemic, the DAWN survey was restarted. \n",
    "In this example, we look at the 2011 data, which have been made available through the [SAMHSA\n",
    "Data Archive](https://www.datafiles.samhsa.gov/study-series/drug-abuse-warning-network-dawn-nid13516)."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The target population consists of all drug-related emergency room visits\n",
    "in the US. These visits are accessed through a frame of emergency rooms in\n",
    "hospitals (and their records). Hospitals are selected for the survey through\n",
    "probability sampling (see {numref}`Chapter %s <ch:theory_datadesign>`), and all\n",
    "drug-related visits to the sampled hospital's emergency room are included in\n",
    "the survey. All types of drug-related visits are included, such as drug misuse,\n",
    "abuse, accidental ingestion, suicide attempts, malicious poisonings, and\n",
    "adverse reactions.  For each visit, the record may contain up to 16 different drugs, including illegal drugs, prescription drugs, and over-the-counter medications. \n",
    "\n",
    "The source file for this dataset is an example of fixed-width formatting that requires external documentation, like a codebook, to decipher. Also, it is a reasonably large file and so motivates the topic of how to find a file's size. And the granularity is unusual because an ER visit, not a person, is the subject of investigation. \n",
    "\n",
    "The San Francisco restaurant files have other characteristics that make them a good example for this chapter."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## San Francisco Restaurant Food Safety\n",
    "\n",
    "The [San Francisco Department of Public Health](https://www.sfdph.org/dph/default2.asp) routinely makes unannounced\n",
    "visits to restaurants and inspects them for food safety.  The inspector\n",
    "calculates a score based on the violations found and provides descriptions\n",
    "of the violations. The target population here is all\n",
    "restaurants in San Francisco. These restaurants are accessed\n",
    "through a frame of restaurant inspections that were conducted between 2013 and\n",
    "2016. Some restaurants have multiple inspections in a year, and not all of the\n",
    "7,000+ restaurants are inspected annually.\n",
    "\n",
    "Food safety scores are available through the city's [Open Data initiative](https://data.sfgov.org/Health-and-Social-Services/Restaurant-Scores-LIVES-Standard/pyih-qa8i/data),\n",
    "called [DataSF](https://datasf.org). DataSF is one example of city governments around the world\n",
    "making their data publicly available; the DataSF mission is to \"empower the use\n",
    "of data in decision making and service delivery\" with the goal of improving the\n",
    "quality of life and work for residents, employers, employees, and visitors.\n",
    "\n",
    "San Francisco requires restaurants to publicly display their scores\n",
    "(see {numref}`Figure %s <scoreCard>` for an example placard).[^CARDS] These data offer an example of multiple files with different structures, fields, and granularity. One dataset contains summary results of inspections, another\n",
    "provides details about the violations found, and a third\n",
    "contains general information about the restaurants. The violations include both serious\n",
    "problems related to the transmission of foodborne illnesses and minor issues such as not properly displaying the\n",
    "inspection placard.  \n",
    "\n",
    "[^CARDS]:In 2020, the city began giving restaurants color-coded placards indicating whether the restaurant passed (green), conditionally passed (yellow), or failed (red) the inspection. These new placards no longer display a numeric inspection score. However, a restaurant's scores and violations are still available at DataSF."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{figure} figures/scoreCardSmall.png\n",
    "---\n",
    "name: scoreCard\n",
    "height: 200px\n",
    "---\n",
    "\n",
    "A food safety scorecard displayed in a restaurant; scores range between 0 and 100\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Both the DAWN survey data and the San Francisco restaurant inspection data are available online as plain-text files. However, their formats are quite different, and in the next section, we demonstrate how to figure out a file format so that we can read the data into a dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
